The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
Authors of R packages to support Apache Spark, TensorFlow and MLflow. Contributors to tidyverse and Apache Arrow.
In an ideal world, all R packages work with Spark, like magic. Such is the case for dplyr and sparklyr.
library(sparklyr)
library(nycflights13)
sc <- spark_connect(master = "local|yarn|mesos|spark|livy")
flights <- copy_to(sc, flights)
sparkxgb is a new sparklyr extension that can be used to train XGBoost models in Spark.
library(sparkxgb)
iris <- copy_to(sc, iris)
xgb_model <- xgboost_classifier(iris, Species ~ ., num_class = 3, num_round = 50, max_depth = 4)
xgb_model %>% ml_predict(iris) %>%
select(Species, predicted_label, starts_with("probability_")) %>% glimpse()#> Observations: ??
#> Variables: 5
#> Database: spark_connection
#> $ Species <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ predicted_label <chr> "setosa", "setosa", "setosa", "setosa", "…
#> $ probability_versicolor <dbl> 0.003566429, 0.003564076, 0.003566429, 0.…
#> $ probability_virginica <dbl> 0.001423170, 0.002082058, 0.001423170, 0.…
#> $ probability_setosa <dbl> 0.9950104, 0.9943539, 0.9950104, 0.995010…
broom summarizes key information about models as data frames, the last sparklyr release marks the completion of all modeling functions.
movies <- data.frame(user = c(1, 2, 0, 1, 2, 0),
item = c(1, 1, 1, 2, 2, 0),
rating = c(3, 1, 2, 4, 5, 4))
copy_to(sc, movies) %>%
ml_als(rating ~ user + item) %>%
augment()# Source: spark<?> [?? x 4]
user item rating .prediction
<dbl> <dbl> <dbl> <dbl>
1 2 2 5 4.86
2 1 2 4 3.98
3 0 0 4 3.88
4 2 1 1 1.08
5 0 1 2 2.00
6 1 1 3 2.80
sparktf is a new sparklyr extension allowing you to write TensorFlow records in Spark. This can be used to preprocess large amounts of data before processing them in GPU instances with Keras or TensorFlow.
VariantSpark is a framework based on scala and spark to analyze genome datasets. It is being developed by CSIRO Bioinformatics team in Australia. VariantSpark was tested on datasets with 3000 samples each one containing 80 million features in either unsupervised clustering approaches and supervised applications, like classification and regression.
library(sparklyr)
library(variantspark)
sc <- spark_connect(master = "local")
vsc <- vs_connect(sc)
hipster_vcf <- vs_read_vcf(vsc,
"inst/extdata/hipster.vcf.bz2")
hipster_labels <- vs_read_csv(vsc,
"inst/extdata/hipster_labels.txt")
labels <- vs_read_labels(vsc,
"inst/extdata/hipster_labels.txt")
vs_importance_analysis(vsc, hipster_vcf, labels, n_trees = 100)Hail is an open-source, general-purpose, Python-based data analysis tool with additional data types and methods for working with genomic data. Hail is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS).
New github.com/r-spark organization to support ecosystem of Spark and R extensions.
Spark NLP: State of the Art Natural Language Processing. The first production grade versions of the latest deep learning NLP research.